Scraping the data from one page:
Code Editor:
'''
We need four imports for this web-scrape.
'''
import pandas as pd #data science library
from bs4 import BeautifulSoup, Comment #HTML parser library
import requests #sleek HTTP library to interact with web-pages
import re #Regex library in Python
'''
The first thing we do here is request the web-page with the requests library.
This scrapes HTTP data from the source.
'''
page = requests.get('https://www.pro-football-reference.com/years/2000/')
'''
Next we place all the page data into a bs4 object. Which is used to extract the
HTML content from the page.
'''
soup = BeautifulSoup(page.content, 'lxml')
'''
The bs4 object contains tables within. We can extract these tables
into a list with a method inherent to bs4. This method finds the start and end
tags for each table in the soup object and appends them into a list within a bs4
object.
'''
tables = soup.find_all('table')
'''
So, if we print tables[0] we will see the first table in HTML format. We can now
convert it to a dataframe with Pandas.
'''
afc_standings_df = pd.read_html(str(tables))[0]
'''
We can do the same for the second table.
'''
nfc_standings_df = pd.read_html(str(tables))[1]
Python Console:
'''
We have now extracted and can view the first two dataframes.
'''
In[]: afc_standings_df.head()
Out[]:
Tm W L W-L% PF PA \
0 AFC East AFC East AFC East AFC East AFC East AFC East
1 Miami Dolphins* 11 5 .688 323 226
2 Indianapolis Colts+ 10 6 .625 429 326
3 New York Jets 9 7 .563 321 321
4 Buffalo Bills 8 8 .500 315 350
PD MoV SoS SRS OSRS DSRS
0 AFC East AFC East AFC East AFC East AFC East AFC East
1 97 6.1 1.0 7.1 0.0 7.1
2 103 6.4 1.5 7.9 7.1 0.8
3 0 0.0 3.5 3.5 1.4 2.2
4 -35 -2.2 2.2 0.0 0.5 -0.5
In[]: nfc_standings_df.head()
Out[]:
Tm W L W-L% PF PA \
0 NFC East NFC East NFC East NFC East NFC East NFC East
1 New York Giants* 12 4 .750 328 246
2 Philadelphia Eagles+ 11 5 .688 351 245
3 Washington Redskins 8 8 .500 281 269
4 Dallas Cowboys 5 11 .313 294 361
PD MoV SoS SRS OSRS DSRS
0 NFC East NFC East NFC East NFC East NFC East NFC East
1 82 5.1 -2.7 2.4 -1.3 3.8
2 106 6.6 -3.6 3.1 1.0 2.1
3 12 0.8 0.2 1.0 -2.9 3.8
4 -67 -4.2 -0.4 -4.6 -1.5 -3.0
Code Editor:
'''
Now let's scrape the Team Offense dataframe. To do this start by extracting all
the comments in the bs4 object.
'''
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
'''
Next we create a regex object to specify a table id.
This Regex is looking for an open table tag(<table), which matches any character
except a new line which matches zero or more repitions (.+?), of the named group
(id="team_stats"), where the potential match dose or does not contain a white space
character ([\s\S]), and ends with a table end tag (</table>).
'''
rx = re.compile(r'<table.+?id="team_stats".+?>[\s\S]+?</table>')
'''
Then use a loop to iterate through each list item in the comments object with
the Regex pattern.
Then, once the Regex pattern is matched it breaks out of the loop.
'''
for c in comments:
try:
table = rx.search(c.string).group(0)
# break the loop if found
break
except:
pass
'''
The table is in HTML format. Convert it with Pandas. The output is in a list
so call it by it's index—here it is 0.
'''
df_team_stats = pd.read_html(str(table))[0]
'''
Now let's get Passing Offense using the same methods.
'''
rx = re.compile(r'<table.+?id="passing".+?>[\s\S]+?</table>')
for c in comments:
try:
table = rx.search(c.string).group(0)
# break the loop if found
break
except:
pass
df_passing_stats = pd.read_html(str(table))[0]
Python Console:
In[]: df_team_stats.head()
Out[]:
Unnamed: 0_level_0 Unnamed: 1_level_0 Unnamed: 2_level_0 \
Rk Tm G
0 1.0 St. Louis Rams 16.0
1 2.0 Denver Broncos 16.0
2 3.0 Oakland Raiders 16.0
3 4.0 Indianapolis Colts 16.0
4 5.0 Minnesota Vikings 16.0
Unnamed: 3_level_0 Unnamed: 4_level_0 Tot Yds & TO \
PF Yds Ply Y/P TO
0 540.0 7075.0 1014.0 7.0 35.0
1 485.0 6554.0 1115.0 5.9 25.0
2 479.0 5776.0 1023.0 5.6 20.0
3 429.0 6141.0 1026.0 6.0 29.0
4 397.0 5961.0 958.0 6.2 28.0
Unnamed: 8_level_0 Unnamed: 9_level_0 Passing \
FL 1stD Cmp Att Yds TD Int
0 12.0 380.0 380.0 587.0 5232.0 37.0 23.0
1 13.0 383.0 354.0 569.0 4243.0 28.0 12.0
2 9.0 337.0 284.0 475.0 3306.0 28.0 11.0
3 14.0 357.0 357.0 571.0 4282.0 33.0 15.0
4 10.0 319.0 307.0 495.0 3832.0 33.0 18.0
Rushing Penalties \
NY/A 1stD Att Yds TD Y/A 1stD Pen Yds 1stPy
0 8.3 247.0 383.0 1843.0 26.0 4.8 112.0 111.0 942.0 21.0
1 7.1 223.0 516.0 2311.0 21.0 4.5 124.0 89.0 792.0 36.0
2 6.6 177.0 520.0 2470.0 23.0 4.8 128.0 118.0 940.0 32.0
3 7.2 213.0 435.0 1859.0 14.0 4.3 111.0 89.0 866.0 33.0
4 7.2 193.0 428.0 2129.0 14.0 5.0 107.0 106.0 908.0 19.0
Unnamed: 25_level_0 Unnamed: 26_level_0 Unnamed: 27_level_0
Sc% TO% EXP
0 46.8 16.7 49.41
1 38.3 11.7 -40.21
2 39.2 10.6 11.03
3 40.4 15.2 47.85
4 39.7 14.9 -60.12
In[]: df_passing_stats.head()
Out[]:
Rk Tm G Cmp Att Cmp% Yds TD TD% \
0 1.0 St. Louis Rams 16.0 380.0 587.0 64.7 5232.0 37.0 6.3
1 2.0 Indianapolis Colts 16.0 357.0 571.0 62.5 4282.0 33.0 5.8
2 3.0 Denver Broncos 16.0 354.0 569.0 62.2 4243.0 28.0 4.9
3 4.0 San Francisco 49ers 16.0 366.0 583.0 62.8 4239.0 32.0 5.5
4 5.0 Kansas City Chiefs 16.0 342.0 582.0 58.8 4149.0 29.0 5.0
Int Int% Lng Y/A AY/A Y/C Y/G Rate Sk Yds.1 NY/A ANY/A \
0 23.0 3.9 85.0 9.4 8.9 14.5 327.0 99.7 44.0 260.0 8.3 7.8
1 15.0 2.6 78.0 7.7 7.7 12.4 267.6 94.7 20.0 131.0 7.2 7.2
2 12.0 2.1 61.0 7.8 7.9 12.6 265.2 94.2 30.0 221.0 7.1 7.1
3 10.0 1.7 69.0 7.5 7.9 12.0 264.9 97.0 25.0 161.0 7.0 7.3
4 15.0 2.6 81.0 7.6 7.4 12.9 259.3 88.5 34.0 259.0 6.7 6.6
Sk% 4QC GWD EXP
0 7.0 1.0 3.0 120.25
1 3.4 2.0 3.0 88.16
2 5.0 2.0 3.0 37.99
3 4.1 NaN NaN 53.79
4 5.5 3.0 3.0 36.06
'''
Here notice there are unnamed column headers in df_team_stats. This is becuase
it is multi-indexed. Meaning there are headers in the first index that describe
headers in the second. Here's what they look like when you call the columns.
'''
In[]: df_team_stats.columns
Out[]:
MultiIndex([( 'Unnamed: 0_level_0', 'Rk'),
( 'Unnamed: 1_level_0', 'Tm'),
( 'Unnamed: 2_level_0', 'G'),
( 'Unnamed: 3_level_0', 'PF'),
( 'Unnamed: 4_level_0', 'Yds'),
( 'Tot Yds & TO', 'Ply'),
( 'Tot Yds & TO', 'Y/P'),
( 'Tot Yds & TO', 'TO'),
( 'Unnamed: 8_level_0', 'FL'),
( 'Unnamed: 9_level_0', '1stD'),
( 'Passing', 'Cmp'),
( 'Passing', 'Att'),
( 'Passing', 'Yds'),
( 'Passing', 'TD'),
( 'Passing', 'Int'),
( 'Passing', 'NY/A'),
( 'Passing', '1stD'),
( 'Rushing', 'Att'),
( 'Rushing', 'Yds'),
( 'Rushing', 'TD'),
( 'Rushing', 'Y/A'),
( 'Rushing', '1stD'),
( 'Penalties', 'Pen'),
( 'Penalties', 'Yds'),
( 'Penalties', '1stPy'),
('Unnamed: 25_level_0', 'Sc%'),
('Unnamed: 26_level_0', 'TO%'),
('Unnamed: 27_level_0', 'EXP')],
)
'''
We can combine them to make this easier to work with and read. The following
takes every item in the first index and joins it with a "_" to every item
in the second index.
'''
In[]: df_team_stats.columns = ['_'.join(col) for col in df_team_stats.columns]
In[]: df_team_stats.columns
Out[]:
Index(['Unnamed: 0_level_0_Rk', 'Unnamed: 1_level_0_Tm',
'Unnamed: 2_level_0_G', 'Unnamed: 3_level_0_PF',
'Unnamed: 4_level_0_Yds', 'Tot Yds & TO_Ply', 'Tot Yds & TO_Y/P',
'Tot Yds & TO_TO', 'Unnamed: 8_level_0_FL', 'Unnamed: 9_level_0_1stD',
'Passing_Cmp', 'Passing_Att', 'Passing_Yds', 'Passing_TD',
'Passing_Int', 'Passing_NY/A', 'Passing_1stD', 'Rushing_Att',
'Rushing_Yds', 'Rushing_TD', 'Rushing_Y/A', 'Rushing_1stD',
'Penalties_Pen', 'Penalties_Yds', 'Penalties_1stPy',
'Unnamed: 25_level_0_Sc%', 'Unnamed: 26_level_0_TO%',
'Unnamed: 27_level_0_EXP'],
dtype='object')
'''
But we're not done fixing the columns. We have to get rid of the Unnamed:
portions of index 1 that joined to index 2. We can do that by using Regex.
So first we create a new Regex pattern to match the unwanted string data
in the index.
So, we're trying to match the following where
n=someDigit: "Unnamed: ([n]|[n][n])_level_n_"
Be aware that regex is strict.
'''
#The below means:
#match "Unnamed: ([a number 0-9]or[a numner 0-5]followed by[a number 0-9]_level_\anyDigitFrom0-9_)"
In[]: re_col_pat = r'(Unnamed: ([0-9]|[0-5][0-9])_level_\d_)'
'''
Now we iterate through every column header in our columns and replace any exact
matches to re_col_pat with nothing and append it to a new list of column headers.
'''
In[]: new_cols = []
for i in df_team_stats.columns:
res = re.sub(re_col_pat , '', i)
new_cols.append(res)
'''
Now we replace the old columns with new ones by putting them both into a
dictionary and renaming them with that dictionary. The rename() method in
Pandas looks like someDf.rename(old_col, new_col). You can also use
this method with a dictionary to rename multiple columns like
someDf.rename({old_col_1:new_col_1, old_col_2:new_col_2}).
'''
In[]: new_col_dict = dict(zip(df_team_stats.columns,new_cols))
In[]: df_team_stats.rename(columns=new_col_dict, inplace=True)
In[]: df_team_stats.columns
Out[]:
Index(['Rk', 'Tm', 'G', 'PF', 'Yds', 'Tot Yds & TO_Ply', 'Tot Yds & TO_Y/P',
'Tot Yds & TO_TO', 'FL', '1stD', 'Passing_Cmp', 'Passing_Att',
'Passing_Yds', 'Passing_TD', 'Passing_Int', 'Passing_NY/A',
'Passing_1stD', 'Rushing_Att', 'Rushing_Yds', 'Rushing_TD',
'Rushing_Y/A', 'Rushing_1stD', 'Penalties_Pen', 'Penalties_Yds',
'Penalties_1stPy', 'Sc%', 'TO%', 'EXP'],
dtype='object')
'''
Now we'll want to insert a year column because eventually we're going to get all
the data from 2000-2020. This data is for 2000 we may want to be able to
differentiate it from other years.
'''
In[]: df_team_stats.insert(loc=0, column='Yr', value=2000)
In[]: df_team_stats.head()
Out[]:
Yr Rk Tm G PF Yds Tot Yds & TO_Ply \
0 2000 1.0 St. Louis Rams 16.0 540.0 7075.0 1014.0
1 2000 2.0 Denver Broncos 16.0 485.0 6554.0 1115.0
2 2000 3.0 Oakland Raiders 16.0 479.0 5776.0 1023.0
3 2000 4.0 Indianapolis Colts 16.0 429.0 6141.0 1026.0
4 2000 5.0 Minnesota Vikings 16.0 397.0 5961.0 958.0
Tot Yds & TO_Y/P Tot Yds & TO_TO FL 1stD Passing_Cmp Passing_Att \
0 7.0 35.0 12.0 380.0 380.0 587.0
1 5.9 25.0 13.0 383.0 354.0 569.0
2 5.6 20.0 9.0 337.0 284.0 475.0
3 6.0 29.0 14.0 357.0 357.0 571.0
4 6.2 28.0 10.0 319.0 307.0 495.0
Passing_Yds Passing_TD Passing_Int Passing_NY/A Passing_1stD \
0 5232.0 37.0 23.0 8.3 247.0
1 4243.0 28.0 12.0 7.1 223.0
2 3306.0 28.0 11.0 6.6 177.0
3 4282.0 33.0 15.0 7.2 213.0
4 3832.0 33.0 18.0 7.2 193.0
Rushing_Att Rushing_Yds Rushing_TD Rushing_Y/A Rushing_1stD \
0 383.0 1843.0 26.0 4.8 112.0
1 516.0 2311.0 21.0 4.5 124.0
2 520.0 2470.0 23.0 4.8 128.0
3 435.0 1859.0 14.0 4.3 111.0
4 428.0 2129.0 14.0 5.0 107.0
Penalties_Pen Penalties_Yds Penalties_1stPy Sc% TO% EXP
0 111.0 942.0 21.0 46.8 16.7 49.41
1 89.0 792.0 36.0 38.3 11.7 -40.21
2 118.0 940.0 32.0 39.2 10.6 11.03
3 89.0 866.0 33.0 40.4 15.2 47.85
4 106.0 908.0 19.0 39.7 14.9 -60.12
'''
There we have a working dataframe for offensive team stats for the year 2000.
Let's move on to get all the dataframes on the page without having to rewrite the
code for every one.
'''
Code Editor
#LET'S WRITE THE FUNCTION TO TAKE 1 ARGUMENT UNIVERSALLY DIFFERENT TO EACH
#HTML PAGE HOUSING THE DATA; THE URL AND THE YEAR.
def get_all_data(url, yr):
#NEXT DO ALMOST EVERYTHING WE'VE ALREADY DONE.
r = requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
#REG EX PATTERN FOR REMAPPING COLS
re_col_pat = r'(Unnamed: ([0-9]|[0-5][0-9])_level_\d_)'
#TEAM NAME CHANGE DICTIONARY
team_name_change_dict = {
'St. Louis Rams':'Los Angeles Rams',
'Oakland Raiders':'Las Vegas Raiders',
'San Diego Chargers':'Los Angeles Chargers',
'Washington Redskins':'Washington Football Team',
}
#CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
tables = soup.find_all('table') #standings table
#ITERATE THROUGH TABLES AND SCRAPE AFC/NFC STANDINGS; THE LOOP ITERATES
#TWICE, ONE FOR EACH TABLE IN TABLES
for i3 in range(len(tables)):
df_s = pd.read_html(str(tables))[i3]
df_s.columns = ['_'.join(col) for col in df_s.columns]
#NEW COL NAMES
new_cols = []
for i4 in df_s.columns:
res = re.sub(re_col_pat, '', i4)
new_cols.append(res)
#CREATE NEW COLS/OLD COLS DICT
new_col_dict = dict(zip(df_s.columns,new_cols))
#APPLY NEW COL NAMES WITH RENAME()
df_s.rename(columns=new_col_dict, inplace=True)
df_s.rename(columns={'T_m':'Tm'}, inplace=True)
#CHANGE TEAM NAMES SO THEY ARE UNIFORM
df_s.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
#INSERT YEAR COL
df_s.insert(loc=0, column='Yr', value=yr)
if i3 == 0:
df_afc_standings = df_s
elif i3 == 1:
df_nfc_standings = df_s
#NOW WE SCRAPE THE DATAFRAMES HIDDEN IN THE COMMENTS
#FIRST INSTANIATE A LIST WITH ALL THE ID'S OF THE HTML TABLES
df_str_lst = ['team_stats', 'passing', 'rushing',
'returns', 'kicking', 'team_scoring',
'team_conversions', 'drives', 'playoff_results']
#SCRAPE THE COMMENTS FROM THE PAGE SO WE CAN ITERATE THROUGH THEM AND GRAB THE DATAFRAMES
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
#ITERATE THROUGH DF_STR_LST WITH A FOR LOOP AND FORMAT THE ITEM ITERATING INTO THE REGEX
for i in range(len(df_str_lst)):
rx_compiler_str = r'<table.+?id="{}".+?>[\s\S]+?</table>'.format(df_str_lst[i])
#SAME AS BEFORE INSTANTIATE A REGEX COMPILED OBJECT AND ITERATE THROUGH THE COMMENTS
rx = re.compile(rx_compiler_str)
for comment in comments:
try:
table = rx.search(comment.string).group(0)
# break the loop if found
break
except:
pass
#CONVERT TABLE AND CLEAN THE DF
df = pd.read_html(str(table))[0]
df.columns = ['_'.join(col) for col in df.columns]
#NEW COL NAMES
new_cols = []
for i2 in df.columns:
res = re.sub(re_col_pat , '', i2)
new_cols.append(res)
#CREATE NEW COLS/OLD COLS DICT
new_col_dict = dict(zip(df.columns,new_cols))
#APPLY NEW COL NAMES WITH RENAME()
df.rename(columns=new_col_dict, inplace=True)
df.rename(columns={'T_m':'Tm'}, inplace=True)
#UPDATE TEAM NAMES SO THEY ARE ALL UNIFORM
df.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
#INSERT YEAR COL
df.insert(loc=0, column='Yr', value=yr)
#NOW USE AN IF STATEMENT THAT CHECKS WHICH TABLE ID THE ITERATION IS ON
#AND ASSIGN IT TO THE CORRESPONDING DATAFRAME
if i == 0:
df_team_stats = df
elif i == 1:
df_passing = df
elif i == 2:
df_rushing = df
elif i == 3:
df_returns = df
elif i == 4:
df_kicking = df
elif i == 5:
df_team_scoring = df
elif i == 6:
df_team_conversions = df
elif i == 7:
df_team_drives = df
elif i == 8:
df_playoff_results = df
#LASTLY RETURN ALL THE DATAFRAMES
return [df_afc_standings, df_nfc_standings,
df_team_stats, df_passing,
df_rushing, df_returns,
df_kicking, df_team_scoring,
df_team_conversions, df_team_drives,
df_playoff_results]
#call the function for the data; each item in the return list is reassigned to a
#variable dataframe
#I reassigned them with the addition of _mast to denote master dataframes for all years
[df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast] = get_all_data('https://www.pro-football-reference.com/years/2000/', 2000)
'''
Now we have all the dataframes on the page.
'''
In[]: df_passing_mast.head()
Out[]:
Yr R_k T_m G C_m_p A_t_t C_m_p_% Y_d_s T_D \
0 2000 1.0 Los Angeles Rams 16.0 380.0 587.0 64.7 5232.0 37.0
1 2000 2.0 Indianapolis Colts 16.0 357.0 571.0 62.5 4282.0 33.0
2 2000 3.0 Denver Broncos 16.0 354.0 569.0 62.2 4243.0 28.0
3 2000 4.0 San Francisco 49ers 16.0 366.0 583.0 62.8 4239.0 32.0
4 2000 5.0 Kansas City Chiefs 16.0 342.0 582.0 58.8 4149.0 29.0
T_D_% I_n_t I_n_t_% L_n_g Y_/_A A_Y_/_A Y_/_C Y_/_G R_a_t_e S_k \
0 6.3 23.0 3.9 85.0 9.4 8.9 14.5 327.0 99.7 44.0
1 5.8 15.0 2.6 78.0 7.7 7.7 12.4 267.6 94.7 20.0
2 4.9 12.0 2.1 61.0 7.8 7.9 12.6 265.2 94.2 30.0
3 5.5 10.0 1.7 69.0 7.5 7.9 12.0 264.9 97.0 25.0
4 5.0 15.0 2.6 81.0 7.6 7.4 12.9 259.3 88.5 34.0
Y_d_s_._1 N_Y_/_A A_N_Y_/_A S_k_% 4_Q_C G_W_D E_X_P
0 260.0 8.3 7.8 7.0 1.0 3.0 120.25
1 131.0 7.2 7.2 3.4 2.0 3.0 88.16
2 221.0 7.1 7.1 5.0 2.0 3.0 37.99
3 161.0 7.0 7.3 4.1 NaN NaN 53.79
4 259.0 6.7 6.6 5.5 3.0 3.0 36.06
In[]: df_team_drives_mast.head()
Out[]:
Yr Rk Tm G Dr Plays Sc% TO% \
0 2000 1.0 Los Angeles Chargers 16.0 206 1007 21.4 23.3
1 2000 2.0 New York Jets 16.0 205 1109 27.8 19.0
2 2000 3.0 Detroit Lions 16.0 204 1034 26.5 14.2
3 2000 4.0 Buffalo Bills 16.0 200 1118 29.0 10.5
4 2000 5.0 Jacksonville Jaguars 16.0 198 1113 34.8 13.6
Average Drive_Plays Average Drive_Yds Average Drive_Start \
0 4.9 20.8 Own 29.3
1 5.4 26.3 Own 33.2
2 5.1 21.6 Own 35.9
3 5.6 27.4 Own 29.4
4 5.6 28.7 Own 34.1
Average Drive_Time Average Drive_Pts
0 2:12 1.15
1 2:23 1.50
2 2:22 1.38
3 2:35 1.51
4 2:33 1.85
Code Editor:
import pandas as pd
from bs4 import BeautifulSoup, Comment
import requests, re
pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
pd.set_option('max_colwidth', 50)
pd.set_option('max_rows', 50)
def get_all_data(url_lst):
#IF/ELSE STATMENT CHECKS FOR A LIST OR STR
if isinstance(url_lst, list):
print('List is being processed...')
else:
url_lst = [url_lst]
print('processing...')
#REG EX PATTERN FOR REMAPPING COLS
re_col_pat = r'(Unnamed: ([0-9]|[0-5][0-9])_level_\d_)'
#REG EX PATTERN FOR GETTING THE YEAR FROM THE URL
re_year_pat = r'(\d\d\d\d)'
#TEAM NAME CHANGE DICTIONARY
team_name_change_dict = {
'St. Louis Rams':'Los Angeles Rams',
'Oakland Raiders':'Las Vegas Raiders',
'San Diego Chargers':'Los Angeles Chargers',
'Washington Redskins':'Washington Football Team',
}
#CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
[df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast] = [pd.DataFrame() for n in range(11)]
for n in url_lst:
#BS4 DATA PARSE
r = requests.get(n)
soup = BeautifulSoup(r.text, 'lxml')
#EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
yr = re.findall(re_year_pat, n)
#CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
tables = soup.find_all('table') #standings table
#####################################################################
#GRAB AFC/NFC STANDINGS
for i3 in range(len(tables)):
df_s = pd.read_html(str(tables))[i3]
df_s.columns = ['_'.join(col) for col in df_s.columns]
#NEW COL NAMES
new_cols = []
for i4 in df_s.columns:
res = re.sub(re_col_pat, '', i4)
new_cols.append(res)
#CREATE NEW COLS/OLD COLS DICT
new_col_dict = dict(zip(df_s.columns,new_cols))
#APPLY NEW COL NAMES WITH RENAME()
df_s.rename(columns=new_col_dict, inplace=True)
df_s.rename(columns={'T_m':'Tm'}, inplace=True)
#CHANGE TEAM NAMES SO THEY ARE UNIFORM
df_s.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
#INSERT YEAR COL
df_s.insert(loc=0, column='Yr', value=yr[0])
if i3 == 0:
df_afc_standings = df_s
df_afc_standings_mast = df_afc_standings_mast.append(df_afc_standings)
elif i3 == 1:
df_nfc_standings = df_s
df_nfc_standings_mast = df_nfc_standings_mast.append(df_nfc_standings)
#####################################################################
#GRAB THE DFs HIDDEN IN COMMENTS
df_str_lst = ['team_stats', 'passing', 'rushing',
'returns', 'kicking', 'team_scoring',
'team_conversions', 'drives', 'playoff_results']
#GET THE COMMENTS
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
#CREATE A MAST DF
for i in range(len(df_str_lst)):
rx_compiler_str = r'<table.+?id="{}".+?>[\s\S]+?</table>'.format(df_str_lst[i])
#LOOK FOR TABLE WITH THE ID "TEAM_STATS"
rx = re.compile(rx_compiler_str)
for comment in comments:
try:
table = rx.search(comment.string).group(0)
# break the loop if found
break
except:
pass
#CONVERT TABLE AND CLEAN THE DF COLS
df = pd.read_html(str(table))[0]
df.columns = ['_'.join(col) for col in df.columns]
#NEW COL NAMES
new_cols = []
for i2 in df.columns:
res = re.sub(re_col_pat , '', i2)
new_cols.append(res)
#CREATE NEW COLS/OLD COLS DICT
new_col_dict = dict(zip(df.columns,new_cols))
#APPLY NEW COL NAMES WITH RENAME()
df.rename(columns=new_col_dict, inplace=True)
df.rename(columns={'T_m':'Tm'}, inplace=True)
#CHANGE TEAM NAMES SO THEY ARE UNIFORM
df.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
#INSERT YEAR COL
df.insert(loc=0, column='Yr', value=yr[0])
#RETURN THE CORRECT DF TO CORRESPONDING MASTER DATASET PER ITERATION
if i == 0:
df_team_stats = df
df_team_stats_mast = df_team_stats_mast.append(df_team_stats)
elif i == 1:
df_passing = df
df_passing_mast = df_passing_mast.append(df_passing)
elif i == 2:
df_rushing = df
df_rushing_mast = df_rushing_mast.append(df_rushing)
elif i == 3:
df_returns = df
df_returns_mast = df_returns_mast.append(df_returns)
elif i == 4:
df_kicking = df
df_kicking_mast = df_kicking_mast.append(df_kicking)
elif i == 5:
df_team_scoring = df
df_team_scoring_mast = df_team_scoring_mast.append(df_team_scoring)
elif i == 6:
df_team_conversions = df
df_team_conversions_mast = df_team_conversions_mast.append(df_team_conversions)
elif i == 7:
df_team_drives = df
df_team_drives_mast = df_team_drives_mast.append(df_team_drives)
elif i == 8:
df_playoff_results = df
df_playoff_results_mast = df_playoff_results_mast.append(df_playoff_results)
#RETURN THE MASTER DATAFRAMES
return [df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast]
'''
Now because nobody wants to copy and paste 20 url's into a list we'll write them
a list comprehension of all the years 2000 to 2020 and a loop to format the year
into the link (the links are all the same except for the year). Then append it
to a stand-alone url list.
'''
year_lst = [i for i in range(2000,2021)]
link_lst = []
for i in range(len(year_lst)):
url = "http://www.pro-football-reference.com/years/{}/".format(year_lst[i])
link_lst.append(url)
del i
del url
'''
Now that we have all our links pass them into the function.
'''
[df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast] = get_all_data(link_lst)
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup, Comment
import requests, re
import string
pd.options.display.max_columns = 50
pd.options.display.max_rows = 50
pd.set_option('max_colwidth', 50)
pd.set_option('max_rows', 50)
def get_all_data(yr_lst):
#IF/ELSE STATMENT CHECKS FOR A LIST OR STR
if isinstance(yr_lst, list):
print('List is being processed...')
else:
yr_lst = [yr_lst]
print('processing...')
link_lst_team_data = []
link_lst_div = []
link_lst_conf = []
url_sb_data = "https://www.pro-football-reference.com/super-bowl/"
for i in range(len(yr_lst)):
url_td = "http://www.pro-football-reference.com/years/{}/".format(yr_lst[i])
url_div = "https://www.pro-football-reference.com/years/{}/week_19.htm".format(yr_lst[i])
url_confs = "https://www.pro-football-reference.com/years/{}/week_20.htm".format(yr_lst[i])
link_lst_team_data.append(url_td)
link_lst_div.append(url_div)
link_lst_conf.append(url_confs)
#REG EX PATTERN FOR REMAPPING COLS
re_col_pat = r'(Unnamed: ([0-9]|[0-5][0-9])_level_\d_)'
#REG EX PATTERN FOR GETTING THE YEAR FROM THE URL
re_year_pat = r'(\d\d\d\d)'
#TEAM NAME CHANGE DICTIONARY
team_name_change_dict = {
'St. Louis Rams':'Los Angeles Rams',
'Oakland Raiders':'Las Vegas Raiders',
'San Diego Chargers':'Los Angeles Chargers',
'Washington Redskins':'Washington Football Team',
}
#CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
[df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast] = [pd.DataFrame() for n in range(11)]
#CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
[df_div_mast] = [pd.DataFrame() for n in range(1)]
#CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
[df_conf_mast] = [pd.DataFrame() for n in range(1)]
#CREATE EMPTY MASTER DATAFRAMES AS PLACEHOLDERS
[df_sb_champ_mast] = [pd.DataFrame() for n in range(1)]
for n in link_lst_team_data:
#BS4 DATA PARSE
r = requests.get(n)
soup = BeautifulSoup(r.text, 'lxml')
#EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
yr = re.findall(re_year_pat, n)
#CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
tables = soup.find_all('table') #standings table
#####################################################################
#GRAB AFC/NFC STANDINGS
for i3 in range(len(tables)):
df_s = pd.read_html(str(tables))[i3]
df_s.columns = ['_'.join(col) for col in df_s.columns]
#NEW COL NAMES
new_cols = []
for i4 in df_s.columns:
res = re.sub(re_col_pat, '', i4)
new_cols.append(res)
#CREATE NEW COLS/OLD COLS DICT
new_col_dict = dict(zip(df_s.columns,new_cols))
#APPLY NEW COL NAMES WITH RENAME()
df_s.rename(columns=new_col_dict, inplace=True)
df_s.rename(columns={'T_m':'Tm'}, inplace=True)
#CHANGE TEAM NAMES SO THEY ARE UNIFORM
df_s.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
#INSERT YEAR COL
df_s.insert(loc=0, column='Yr', value=yr[0])
if i3 == 0:
df_afc_standings = df_s
df_afc_standings_mast = df_afc_standings_mast.append(df_afc_standings)
elif i3 == 1:
df_nfc_standings = df_s
df_nfc_standings_mast = df_nfc_standings_mast.append(df_nfc_standings)
#####################################################################
#GRAB THE DFs HIDDEN IN COMMENTS
df_str_lst = ['team_stats', 'passing', 'rushing',
'returns', 'kicking', 'team_scoring',
'team_conversions', 'drives', 'playoff_results']
#GET THE COMMENTS
comments = soup.findAll(text=lambda text:isinstance(text, Comment))
#CREATE A MAST DF
for i in range(len(df_str_lst)):
rx_compiler_str = r'<table.+?id="{}".+?>[\s\S]+?</table>'.format(df_str_lst[i])
#LOOK FOR TABLE WITH THE ID "TEAM_STATS"
rx = re.compile(rx_compiler_str)
for comment in comments:
try:
table = rx.search(comment.string).group(0)
# break the loop if found
break
except:
pass
#CONVERT TABLE AND CLEAN THE DF COLS
df = pd.read_html(str(table))[0]
df.columns = ['_'.join(col) for col in df.columns]
#NEW COL NAMES
new_cols = []
for i2 in df.columns:
res = re.sub(re_col_pat , '', i2)
new_cols.append(res)
#CREATE NEW COLS/OLD COLS DICT
new_col_dict = dict(zip(df.columns,new_cols))
#APPLY NEW COL NAMES WITH RENAME()
df.rename(columns=new_col_dict, inplace=True)
df.rename(columns={'T_m':'Tm'}, inplace=True)
df.rename(columns={'R_k':'Rk'}, inplace=True)
#CHANGE TEAM NAMES SO THEY ARE UNIFORM
df.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
#DROP ANY ROWS THAT START WITH Avg Team, League Total, Avg Tm/G #############FINISH
drop_avgs_lst = ['Avg Team', 'League Total', 'Avg Tm/G']
try:
df = df.loc[~df['Tm'].isin(drop_avgs_lst)]
except:
pass
#INSERT YEAR COL
df.insert(loc=0, column='Yr', value=yr[0])
#RETURN THE CORRECT DF TO CORRESPONDING MASTER DATASET PER ITERATION
if i == 0:
df_team_stats = df
df_team_stats_mast = df_team_stats_mast.append(df_team_stats)
elif i == 1:
df_passing = df
df_passing_mast = df_passing_mast.append(df_passing)
elif i == 2:
df_rushing = df
df_rushing_mast = df_rushing_mast.append(df_rushing)
elif i == 3:
df_returns = df
df_returns_mast = df_returns_mast.append(df_returns)
elif i == 4:
df_kicking = df
df_kicking_mast = df_kicking_mast.append(df_kicking)
elif i == 5:
df_team_scoring = df
df_team_scoring_mast = df_team_scoring_mast.append(df_team_scoring)
elif i == 6:
df_team_conversions = df
df_team_conversions_mast = df_team_conversions_mast.append(df_team_conversions)
elif i == 7:
df_team_drives = df
df_team_drives_mast = df_team_drives_mast.append(df_team_drives)
elif i == 8:
df_playoff_results = df
df_playoff_results_mast = df_playoff_results_mast.append(df_playoff_results)
#parse division URLs
try:
for n in link_lst_div:
#BS4 DATA PARSE
r = requests.get(n)
soup = BeautifulSoup(r.text, 'lxml')
#EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
yr = re.findall(re_year_pat, n)
#CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
champ_tables_div = soup.find_all('table') #standings table
for i in range(0,8,2):
df_s = pd.read_html(str(champ_tables_div))[i]
#INSERT YEAR COL
df_s.insert(loc=0, column='Yr', value=yr[0])
if i == 0:
df_div_0 = df_s
df_div_mast = df_div_mast.append(df_div_0)
elif i == 2:
df_div_1 = df_s
df_div_mast = df_div_mast.append(df_div_1)
elif i == 4:
df_div_2 = df_s
df_div_mast = df_div_mast.append(df_div_2)
elif i == 6:
df_div_3 = df_s
df_div_mast = df_div_mast.append(df_div_3)
#CHANGE TEAM NAMES SO THEY ARE UNIFORM
df_div_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
except:
df_div_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
print('no division table found')
#Parse conf champ URLs
try:
for n in link_lst_conf:
#BS4 DATA PARSE
r = requests.get(n)
soup = BeautifulSoup(r.text, 'lxml')
#EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
yr = re.findall(re_year_pat, n)
#CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
champ_tables_conf = soup.find_all('table') #standings table
for i in range(0,4,2):#FIX THE RANGE
df_s = pd.read_html(str(champ_tables_conf))[i]
#INSERT YEAR COL
df_s.insert(loc=0, column='Yr', value=yr[0])
if i == 0:
df_conf_0 = df_s
df_conf_mast = df_conf_mast.append(df_conf_0)
elif i == 2:
df_conf_1 = df_s
df_conf_mast = df_conf_mast.append(df_conf_1)
#CHANGE TEAM NAMES SO THEY ARE UNIFORM
df_conf_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
except:
df_conf_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
print('no conference table found')
#parse super bowl URL
#BS4 DATA PARSE
sb_df_link = requests.get(url_sb_data)
soup = BeautifulSoup(sb_df_link.text, 'lxml')
#EXTRACT THE YEAR FROM THE URL STRING IN CURRENT ITERATION
yr = re.findall(re_year_pat, n)
#CONVERT AVAILIBLE AFC/NFC TABLEs (NOT HIDDEN IN COMMENTS)
sb_data = soup.find_all('table') #standings table
df_sb_mast = pd.read_html(str(sb_data))[0]
df_sb_mast['Date'] = pd.to_datetime(df_sb_mast['Date'])
df_sb_mast['Yr'] = df_sb_mast['Date'].dt.year
#CHANGE TEAM NAMES SO THEY ARE UNIFORM
df_sb_mast.replace(to_replace=team_name_change_dict, regex=True, inplace=True)
print('Done.')
#RETURN THE MASTER DATAFRAMES (ORDER HERE MATTERS)
return [df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast,
df_sb_mast, df_div_mast, df_conf_mast]
#PASS THE YEARS THROUGH THE FUNCTION
[df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast,
df_sb_champ_mast, df_div_mast, df_conf_mast] = get_all_data([i for i in range(2000,2023)])
'''
It is good practice to avoid changing the master data sets too much.
So try not to change the originals.
One of the first things we are going to do is combine the master standings
dataframes. We will combine them and re-assign them into a new dataframe.
'''
#IMPORTS
import pandas as pd
import numpy as np
import string
#PANDAS VIEW
pd.options.display.max_columns = None
pd.options.display.max_rows = None
pd.set_option('max_colwidth', None)
pd.set_option('max_rows', None)
pd.set_option('display.expand_frame_repr', False)
import matplotlib.pyplot as plt # To visualize
import seaborn as sns # To visualize
pd.options.mode.chained_assignment = None # default='warn'
'''
Here is a function to add suffixes to columns of dataframes. When we merge or
append dataframes, if they have like columns, this will aid in differentiating
them.
'''
def add_suffix(df, suf_str):
keep_same = {'Yr', 'Tm'}
df.columns = ['{}{}'.format(c, '' if c in keep_same else suf_str)
for c in df.columns]
'''
First, import the master dataframes. This is not the most efficient way but this
is just how I did it.
'''
def data_import():
df_afc_standings_mast = pd.read_csv(r'Path\to\data\df_afc_standings_mast.csv', engine='python', index_col=False)
df_kicking_mast = pd.read_csv(r'Path\to\data\df_kicking_mast.csv', engine='python', index_col=False)
df_nfc_standings_mast = pd.read_csv(r'Path\to\data\df_nfc_standings_mast.csv', engine='python', index_col=False)
df_passing_mast = pd.read_csv(r'Path\to\data\df_passing_mast.csv', engine='python', index_col=False)
df_playoff_results_mast = pd.read_csv(r'Path\to\data\df_playoff_results_mast.csv', engine='python', index_col=False)
df_returns_mast = pd.read_csv(r'Path\to\data\df_returns_mast.csv', engine='python', index_col=False)
df_rushing_mast = pd.read_csv(r'Path\to\data\df_rushing_mast.csv', engine='python', index_col=False)
df_team_conversions_mast = pd.read_csv(r'Path\to\data\df_team_conversions_mast.csv', engine='python', index_col=False)
df_team_drives_mast = pd.read_csv(r'Path\to\data\df_team_drives_mast.csv', engine='python', index_col=False)
df_team_scoring_mast = pd.read_csv(r'Path\to\data\df_team_scoring_mast.csv', engine='python', index_col=False)
df_team_stats_mast = pd.read_csv(r'Path\to\data\df_team_stats_mast.csv', engine='python', index_col=False)
df_div_mast = pd.read_csv(r'Path\to\data\df_div_mast.csv', engine='python', index_col=False)
df_conf_mast = pd.read_csv(r'Path\to\data\df_conf_mast.csv', engine='python', index_col=False)
df_sb_champ_mast = pd.read_csv(r'Path\to\data\df_sb_champ_mast.csv', engine='python', index_col=False)
return [df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast, df_div_mast,
df_conf_mast, df_sb_champ_mast]
[df_afc_standings_mast, df_nfc_standings_mast,
df_team_stats_mast, df_passing_mast,
df_rushing_mast, df_returns_mast,
df_kicking_mast, df_team_scoring_mast,
df_team_conversions_mast, df_team_drives_mast,
df_playoff_results_mast, df_div_mast,
df_conf_mast, df_sb_champ_mast] = data_import()
'''
Next, clean the standings data. This data will be the base of our analysis dataframe.
'''
def clean_standings():
team_names_lst = ['Los Angeles Rams', 'Denver Broncos', 'Las Vegas Raiders',
'Indianapolis Colts', 'Minnesota Vikings', 'Tampa Bay Buccaneers',
'San Francisco 49ers', 'Jacksonville Jaguars',
'Kansas City Chiefs', 'New Orleans Saints', 'Green Bay Packers',
'Philadelphia Eagles', 'Tennessee Titans', 'Baltimore Ravens',
'New York Giants', 'Miami Dolphins', 'Pittsburgh Steelers',
'New York Jets', 'Seattle Seahawks', 'Buffalo Bills',
'Carolina Panthers', 'Detroit Lions', 'Dallas Cowboys',
'Washington Football Team', 'New England Patriots',
'Los Angeles Chargers', 'Atlanta Falcons', 'Chicago Bears',
'Arizona Cardinals', 'Cincinnati Bengals', 'Cleveland Browns',
'Houston Texans']
#PUT THE STANDINGS DFs TOGETHER
df_wins_losses = df_afc_standings_mast.append(df_nfc_standings_mast)
#STRIP ANY PUNCTUATION
df_wins_losses['Tm'] = df_wins_losses['Tm'].str.strip(string.punctuation)
#DROP ANY ROW THAT IS NOT A TEAM
df_wins_losses = df_wins_losses.loc[df_wins_losses['Tm'].isin(team_names_lst)]
#ISOLATE THE FOUR STANDINGS COLUMNS WE NEED
df_standings_wins_loss_percent = df_wins_losses.iloc[0:,0:5]
#MERGE THEM ON Yr AND Tm
df_team_stats_mast_w_percent = df_standings_wins_loss_percent.merge(df_standings_wins_loss_percent, suffixes=('_standings','stand'), how='left', on=['Yr','Tm'])
#GET A FOUNDATIONAL DF TO COMPARE WL% TO OTHER VARIABLES
df1 = df_team_stats_mast_w_percent.iloc[0:,0:5]
return df1
#DECLARES THE STANDINGS DATA AS df
df = clean_standings()
'''
Then, drop all of the Rk and G columns from the dataframes. We do not need them.
'''
#DROP ALL THE RANKS AND GAMES COLUMNS
def dropper():
df_passing = df_passing_mast.drop(columns=['Rk','G'])
add_suffix(df_passing, '_passing')
df_kicking = df_kicking_mast.drop(columns=['Rk', 'G'])
add_suffix(df_kicking, '_kicking')
df_returns = df_returns_mast.drop(columns=['Rk', 'G'])
add_suffix(df_returns, '_returns')
df_rushing = df_rushing_mast.drop(columns=['Rk', 'G'])
add_suffix(df_rushing, '_rushing')
df_conversions = df_team_conversions_mast.drop(columns=['Rk', 'G'])
add_suffix(df_conversions, '_conversions')
df_drives = df_team_drives_mast.drop(columns=['Rk', 'G'])
add_suffix(df_drives, '_drives')
df_scoring = df_team_scoring_mast.drop(columns=['Rk', 'G'])
add_suffix(df_scoring, '_scoring')
df_team_stats = df_team_stats_mast.drop(columns=['Rk','G'])
add_suffix(df_team_stats, '_team_offense')
df_drop_lst = [df_passing, df_kicking,
df_returns, df_rushing,
df_conversions, df_drives,
df_scoring, df_team_stats]
return df_drop_lst
#THIS PLACES THE DFs RETURNED INTO A LIST
df_merger_lst = dropper()
#THIS MERGES ALL THE DATAFRAMES INTO df ON Tm AND Yr.
for i in df_merger_lst:
df = df.merge(i, on=['Tm','Yr'])
'''
Many of the columns are object data and we need them to be numeric.
get_sec() converts columns that have time in minutes and sec to only sec.
'''
def get_sec(time_str):
m, s = time_str.split(':')
return int(m) * 60 + int(s)
def objs_to_numerics():
df['Avg_Drive_Time_Sec_drives'] = [get_sec(i) for i in df['Average Drive_Time_drives']]
df.drop(columns=['Average Drive_Time_drives'],inplace=True)
#FIRST FIND THEM AND PUT THEM IN A LIST
obj_cols = []
for i in df.iloc[0:,2:].columns:
if df[i].dtype == object:
obj_cols.append(i)
else:
pass
#NOW THAT WE HAVE THEM STRIP THE % SIGN FROM PERCENT COLS
for i in obj_cols:
df[i] = df[i].str.strip(string.punctuation)
df[i] = df[i].str.strip(r'Own ') #all are 'Own '
df[i] = pd.to_numeric(df[i])
objs_to_numerics()
'''
We need to add a boolean column that marks whether a team made it to divisionals,
conference and/or the super bowl. We also want to track which teams won their
conference or division game or not.
'''
def wrangle_div_conf(some_df, tm_col):
team_names_lst = ['Los Angeles Rams', 'Denver Broncos', 'Las Vegas Raiders',
'Indianapolis Colts', 'Minnesota Vikings', 'Tampa Bay Buccaneers',
'San Francisco 49ers', 'Jacksonville Jaguars',
'Kansas City Chiefs', 'New Orleans Saints', 'Green Bay Packers',
'Philadelphia Eagles', 'Tennessee Titans', 'Baltimore Ravens',
'New York Giants', 'Miami Dolphins', 'Pittsburgh Steelers',
'New York Jets', 'Seattle Seahawks', 'Buffalo Bills',
'Carolina Panthers', 'Detroit Lions', 'Dallas Cowboys',
'Washington Football Team', 'New England Patriots',
'Los Angeles Chargers', 'Atlanta Falcons', 'Chicago Bears',
'Arizona Cardinals', 'Cincinnati Bengals', 'Cleveland Browns',
'Houston Texans']
#.loc ALL THE ROWS THAT CONTAIN THE TEAMS
some_df = some_df.loc[some_df[tm_col].isin(team_names_lst)]
#CONVERT THE SCORES FROM AN OBJECT STRING TO INTEGERS
some_df['1'] = some_df['1'].astype(int)
#DROP THE FINALS COLUMN
some_df.drop(columns='2',inplace=True)
#REPLACE THE COL NAMES
some_df.rename(columns={'0':'Tm', '1':'score'}, inplace=True)
#BOOLEAN INDEX THE WINNER AND LOSER IN A COL
#NOW SEPARTE THE DATA INTO EVERY OTHER AND EVERY NTH ; EASIER TO WORK WITH
df_div_n2 = list(some_df['score'].iloc[0::2])
df_div_n3 = list(some_df['score'].iloc[1::2])
#ITERATE THROUGH THE LISTS TO GET TRUE FOR WIN; AND FALSE FOR LOSS
start_idx = 0
win_loss_bool = []
for i in range(len(df_div_n2)):
win_loss_bool.append(df_div_n2[start_idx] > df_div_n3[start_idx])
win_loss_bool.append(df_div_n2[start_idx] < df_div_n3[start_idx])
start_idx += 1
#ADD A WIN_BOOL COLUMN TO THE DF
some_df['win_bool'] = win_loss_bool
return some_df
df_div = wrangle_div_conf(df_div_mast, '0')
df_conf = wrangle_div_conf(df_conf_mast, '0')
#RENAME THE COLUMNS IN DIVISION AND CONFERENCE DATA
df_div.rename(columns={'score':'score_div', 'win_bool':'win_bool_div'}, inplace=True)
df_conf.rename(columns={'score':'score_conf', 'win_bool':'win_bool_conf'}, inplace=True)
'''
Now in df we have to add whether the team went to conference and/or divisionals.
'''
#GET THE TM YR COLUMNS FROM DF IN A STANDALONE DATAFRAME
df_ana = df.iloc[0:,0:2]
#SET THE YR TM COLUMNS TO THE INDEX DF_ANA
df.set_index(['Yr','Tm'], inplace=True)
df_div.set_index(['Yr','Tm'], inplace=True)
df_conf.set_index(['Yr','Tm'], inplace=True)
#MERGE THE DATAFRAMES
df = df.merge(df_div, left_index=True, right_index=True, how='left')
df = df.merge(df_conf, left_index=True, right_index=True, how='left')
#FIX THE 'SC%_TEAM_OFFENSE' PROPORTION
df['W_-_L_%_standings'] = df['W_-_L_%_standings'] / 1000
'''
Now we must clean the superbowl data.
'''
def clean_sb_data():
#ISOLATE THE COLUMNS WE NEED
df_sb = df_sb_champ_mast[['Yr', 'Winner', 'Pts', 'Loser', 'Pts.1']]
df_sb['Yr'] = df_sb['Yr']-1
#SEPERATE THE LOSERS AND WINNERS
sb_loser = df_sb[['Yr', 'Loser', 'Pts.1']]
sb_winner = df_sb.iloc[0:,0:3]
#CHANGE THE WINNER AND LOSER DATAFRAMES TEAM COLUMN TO TM AND PTS TO SCORE
sb_loser.rename(columns={'Loser':'Tm', 'Pts.1':'score'}, inplace=True)
sb_winner.rename(columns={'Winner':'Tm', 'Pts':'score'}, inplace=True)
#CONCAT THE SEPARATED DATAFRAMES
conc_sb_dfs = [sb_winner, sb_loser]
df_sb = pd.concat(conc_sb_dfs)
#RESET THE INDEX OF THE DATAFRAME
df_sb.reset_index(drop=True, inplace=True)
#ADD A WIN LOSS BOOL COLUMN
df_sb_win_loss_bool = []
for i in df_sb.index:
if i >= 55:
i = False
df_sb_win_loss_bool.append(i)
else:
i = True
df_sb_win_loss_bool.append(i)
df_sb['win_loss_bool'] = df_sb_win_loss_bool
#LOC IN THE YEARS 2000 TO 2020
df_sb = df_sb.loc[df_sb['Yr'] >= 2000]
#ADD A SUFFIX TO THE COLUMNS
add_suffix(df_sb, '_sb')
#SET THE INDEX TO MATCH DF
df_sb.set_index(['Yr', 'Tm'], inplace=True)
return df_sb
df_sb = clean_sb_data()
df = df.merge(df_sb, left_index=True, right_index=True, how='left')
'''
Now we have to organize a new dataframe with divisions, conferences, and
superbowl data at the forefront.
'''
#Start with a function that cleanly prints the columns of df by 3s
#makes it easier to work with
def get_df_cols():
idx_v = 0
try:
for i in range(len(df.columns)):
print("'{}','{}','{}',".format(df.columns[idx_v+0], df.columns[idx_v+1], df.columns[idx_v+2]))
idx_v += 3
except:
pass
return
'''
Finally, corral it all into one big dataframe.
'''
def wrangle_ana_df():
'''
This re-orders and re-assigns df.
'''
df_ana = df[['score_div', 'win_bool_div', 'score_conf',
'win_bool_conf', 'score_sb', 'win_loss_bool_sb',
'W_standings','L_standings','W_-_L_%_standings',
'C_m_p_passing','A_t_t_passing','C_m_p_%_passing',
'Y_d_s_passing','T_D_passing','T_D_%_passing',
'I_n_t_passing','I_n_t_%_passing','L_n_g_passing',
'Y_/_A_passing','A_Y_/_A_passing','Y_/_C_passing',
'Y_/_G_passing','R_a_t_e_passing','S_k_passing',
'Y_d_s_._1_passing','S_k_%_passing','N_Y_/_A_passing',
'A_N_Y_/_A_passing','4_Q_C_passing','G_W_D_passing',
'E_X_P_passing','0-19_FGA_kicking','0-19_FGM_kicking',
'20-29_FGA_kicking','20-29_FGM_kicking','30-39_FGA_kicking',
'30-39_FGM_kicking','40-49_FGA_kicking','40-49_FGM_kicking',
'50+_FGA_kicking','50+_FGM_kicking','Scoring_FGA_kicking',
'Scoring_FGM_kicking','Scoring_Lng_kicking','Scoring_FG%_kicking',
'Scoring_XPA_kicking','Scoring_XPM_kicking','Scoring_XP%_kicking',
'Kickoffs_KO_kicking','Kickoffs_KOYds_kicking','Kickoffs_TB_kicking',
'Kickoffs_TB%_kicking','Kickoffs_KOAvg_kicking','Punting_Pnt_kicking',
'Punting_Yds_kicking','Punting_Lng_kicking','Punting_Blck_kicking',
'Punting_Y/P_kicking','Punt Returns_Ret_returns','Punt Returns_Yds_returns',
'Punt Returns_TD_returns','Punt Returns_Lng_returns','Punt Returns_Y/R_returns',
'Kick Returns_Rt_returns','Kick Returns_Yds_returns','Kick Returns_TD_returns',
'Kick Returns_Lng_returns','Kick Returns_Y/Rt_returns','APYd_returns',
'A_t_t_rushing','Y_d_s_rushing','T_D_rushing',
'L_n_g_rushing','Y_/_A_rushing','Y_/_G_rushing',
'F_m_b_rushing','E_X_P_rushing','Downs_3DAtt_conversions',
'Downs_3DConv_conversions','Downs_3D%_conversions','Downs_4DAtt_conversions',
'Downs_4DConv_conversions','Downs_4D%_conversions','Red Zone_RZAtt_conversions',
'Red Zone_RZTD_conversions','Red Zone_RZPct_conversions','#Dr_drives',
'Plays_drives','Sc%_drives','TO%_drives',
'Average Drive_Plays_drives','Average Drive_Yds_drives','Average Drive_Start_drives',
'Average Drive_Pts_drives','R_s_h_T_D_scoring','R_e_c_T_D_scoring',
'P_R_ _T_D_scoring','K_R_ _T_D_scoring','F_b_l_T_D_scoring',
'I_n_t_T_D_scoring','O_t_h_T_D_scoring','A_l_l_T_D_scoring',
'2_P_M_scoring','2_P_A_scoring','X_P_M_scoring',
'X_P_A_scoring','F_G_M_scoring','F_G_A_scoring',
'S_f_t_y_scoring','P_t_s_scoring','P_t_s_/_G_scoring',
'D_2_P_scoring','PF_team_offense','Yds_team_offense',
'Tot Yds & TO_Ply_team_offense','Tot Yds & TO_Y/P_team_offense','Tot Yds & TO_TO_team_offense',
'FL_team_offense','1stD_team_offense','Passing_Cmp_team_offense',
'Passing_Att_team_offense','Passing_Yds_team_offense','Passing_TD_team_offense',
'Passing_Int_team_offense','Passing_NY/A_team_offense','Passing_1stD_team_offense',
'Rushing_Att_team_offense','Rushing_Yds_team_offense','Rushing_TD_team_offense',
'Rushing_Y/A_team_offense','Rushing_1stD_team_offense','Penalties_Pen_team_offense',
'Penalties_Yds_team_offense','Penalties_1stPy_team_offense','Sc%_team_offense',
'TO%_team_offense','EXP_team_offense','Avg_Drive_Time_Sec_drives',]]
#Convert all the data to floats
df_ana = df_ana.iloc[0:,0:].astype(float)
#sort the index of ana_df
df_ana.sort_index(inplace=True)
#reset the index and get the Yr and Tm cols back into the dataframe
df_ana.reset_index(inplace=True)
return df_ana
df_ana = wrangle_ana_df()
#EXPORT DF_ANA
df_ana.to_csv(r'Export\file\path\of\df_ana.csv', index=False)
'''
Now we have all of the data into a dataframe we can use as a primary for analysis.
You will have to do some more wrangling but this was the heaviest lifting.
'''
'''
First I'm going to isolate this years data into a separate dataframe.
Because this data is for the regular season of 16 games—at this time the 2021
season hasn't played all 16 games.
'''
In[]: df_ana_2021 = df_ana.iloc[-32:,0:]
In[]: df_ana_d = df_ana.iloc[0:670, 0:]
In[]: df_ana_2021.reset_index(drop=True, inplace=True)
'''
We can start by looking at some discripive statistics of the data in df_ana.
I am going to exclude the division, conference, and super bowl data for now.
'''
In[]: df_ana.iloc[0:,8:].describe()
#Here is a snippet of the return
W_standings L_standings W_-_L_%_standings C_m_p_passing \
count 670.000000 670.000000 670.000000 670.000000
mean 7.983582 7.983582 0.498746 331.092537
std 3.101863 3.098005 0.193672 48.968711
min 0.000000 0.000000 0.000000 204.000000
25% 6.000000 6.000000 0.375000 297.250000
50% 8.000000 8.000000 0.500000 329.000000
75% 10.000000 10.000000 0.625000 366.000000
max 16.000000 16.000000 0.938000 472.000000
A_t_t_passing C_m_p_%_passing Y_d_s_passing T_D_passing \
count 670.000000 670.000000 670.000000 670.00000
mean 540.602985 61.103582 3555.582090 23.08209
std 59.762089 4.392035 620.522169 7.45294
min 358.000000 45.600000 1898.000000 6.00000
25% 501.000000 58.000000 3098.500000 18.00000
50% 542.500000 61.100000 3536.500000 22.00000
75% 583.000000 64.300000 3998.000000 27.75000
max 740.000000 73.400000 5444.000000 55.00000 ...
'''
The above shows the counts the mean (average), standard deviation (std), min, max,
and percentiles for each variable. But this is also for the entire data. We can also
isolate data by year.
'''
df_2000 = df_ana.loc[df_ana['Yr']==2000]
'''
Here we will generate a corrolation matrix. Starting at the wins and losses columns.
I am going to exclude the division, conference, and super bowl data for now.
'''
In[]: df_corr = df_ana.iloc[:,8:].corr(method='pearson')
'''
So if we look at the first ten rows and four columns we can see the structure of a
corrolation matrix.
'''
W_standings L_standings W_-_L_%_standings C_m_p_passing
W_standings 1.000000 -0.998346 0.979706 0.122332
L_standings -0.998346 1.000000 -0.979749 -0.127578
W_-_L_%_standings 0.979706 -0.979749 1.000000 0.113643
C_m_p_passing 0.122332 -0.127578 0.113643 1.000000
A_t_t_passing -0.114731 0.109959 -0.118299 0.886518
C_m_p_%_passing 0.429580 -0.433127 0.418009 0.697468
Y_d_s_passing 0.313159 -0.317413 0.300767 0.872765
T_D_passing 0.502388 -0.503220 0.475131 0.610782
T_D_%_passing 0.591888 -0.591131 0.565948 0.327591
I_n_t_passing -0.473305 0.476590 -0.465494 -0.078404
'''
It shows the corrolation coefficients for each variable to the others.
So, let us look at all the coeficients for W_-_L_%_standings.
'''
In[]: w_loss_per_corr = df_corr.iloc[0:, 2]
Out[]:
W_standings 0.979706
L_standings -0.979749
W_-_L_%_standings 1.000000
C_m_p_passing 0.113643
A_t_t_passing -0.118299
C_m_p_%_passing 0.418009
Y_d_s_passing 0.300767
T_D_passing 0.475131
T_D_%_passing 0.565948
I_n_t_passing -0.465494
I_n_t_%_passing -0.422179
L_n_g_passing 0.066474
Y_/_A_passing 0.512011
A_Y_/_A_passing 0.591349
Y_/_C_passing 0.327409
Y_/_G_passing 0.300762
R_a_t_e_passing 0.589854
S_k_passing -0.456082
Y_d_s_._1_passing -0.453944
S_k_%_passing -0.368286
N_Y_/_A_passing 0.569628
A_N_Y_/_A_passing 0.619554
4_Q_C_passing 0.329574
G_W_D_passing 0.375747
E_X_P_passing 0.560876
0-19_FGA_kicking -0.005201
0-19_FGM_kicking -0.018923
20-29_FGA_kicking 0.225080
20-29_FGM_kicking 0.230720
30-39_FGA_kicking 0.209211
30-39_FGM_kicking 0.216945
40-49_FGA_kicking 0.125072
40-49_FGM_kicking 0.151656
50+_FGA_kicking -0.122340
50+_FGM_kicking -0.080407
Scoring_FGA_kicking 0.274723
Scoring_FGM_kicking 0.310597
Scoring_Lng_kicking -0.007467
Scoring_FG%_kicking 0.187031
Scoring_XPA_kicking 0.674673
Scoring_XPM_kicking 0.675618
Scoring_XP%_kicking 0.057681
Kickoffs_KO_kicking 0.567831
Kickoffs_KOYds_kicking 0.575851
Kickoffs_TB_kicking 0.148144
Kickoffs_TB%_kicking 0.025797
Kickoffs_KOAvg_kicking 0.222601
Punting_Pnt_kicking -0.374634
Punting_Yds_kicking -0.386927
Punting_Lng_kicking -0.027955
Punting_Blck_kicking -0.019960
Punting_Y/P_kicking -0.066211
Punt Returns_Ret_returns 0.142696
Punt Returns_Yds_returns 0.171174
Punt Returns_TD_returns 0.047896
Punt Returns_Lng_returns 0.069457
Punt Returns_Y/R_returns 0.089882
Kick Returns_Rt_returns -0.259872
Kick Returns_Yds_returns -0.238477
Kick Returns_TD_returns 0.020882
Kick Returns_Lng_returns 0.031521
Kick Returns_Y/Rt_returns 0.060090
APYd_returns 0.353581
A_t_t_rushing 0.466039
Y_d_s_rushing 0.381405
T_D_rushing 0.480960
L_n_g_rushing -0.011970
Y_/_A_rushing 0.140499
Y_/_G_rushing 0.381364
F_m_b_rushing -0.306978
E_X_P_rushing 0.022537
Downs_3DAtt_conversions -0.141215
Downs_3DConv_conversions 0.397804
Downs_3D%_conversions 0.497981
Downs_4DAtt_conversions -0.451078
Downs_4DConv_conversions -0.172604
Downs_4D%_conversions 0.217802
Red Zone_RZAtt_conversions 0.597728
Red Zone_RZTD_conversions 0.587456
Red Zone_RZPct_conversions 0.379994
#Dr_drives -0.110206
Plays_drives 0.276659
Sc%_drives 0.643822
TO%_drives -0.504452
Average Drive_Plays_drives 0.287888
Average Drive_Yds_drives 0.455470
Average Drive_Start_drives 0.390743
Average Drive_Pts_drives 0.638762
R_s_h_T_D_scoring 0.480960
R_e_c_T_D_scoring 0.475131
P_R_ _T_D_scoring 0.021229
K_R_ _T_D_scoring -0.100318
F_b_l_T_D_scoring 0.066160
I_n_t_T_D_scoring 0.225410
O_t_h_T_D_scoring 0.159459
A_l_l_T_D_scoring 0.656068
2_P_M_scoring 0.092491
2_P_A_scoring -0.086518
X_P_M_scoring 0.675618
X_P_A_scoring 0.674673
F_G_M_scoring 0.310597
F_G_A_scoring 0.274723
S_f_t_y_scoring 0.119496
P_t_s_scoring 0.703648
P_t_s_/_G_scoring 0.703709
D_2_P_scoring NaN
PF_team_offense 0.703685
Yds_team_offense 0.497314
Tot Yds & TO_Ply_team_offense 0.255818
Tot Yds & TO_Y/P_team_offense 0.485444
Tot Yds & TO_TO_team_offense -0.487077
FL_team_offense -0.298245
1stD_team_offense 0.473765
Passing_Cmp_team_offense 0.113643
Passing_Att_team_offense -0.118299
Passing_Yds_team_offense 0.300767
Passing_TD_team_offense 0.475131
Passing_Int_team_offense -0.465494
Passing_NY/A_team_offense 0.569628
Passing_1stD_team_offense 0.279824
Rushing_Att_team_offense 0.466039
Rushing_Yds_team_offense 0.381405
Rushing_TD_team_offense 0.480960
Rushing_Y/A_team_offense 0.140499
Rushing_1stD_team_offense 0.434559
Penalties_Pen_team_offense -0.113117
Penalties_Yds_team_offense -0.078770
Penalties_1stPy_team_offense 0.103788
Sc%_team_offense 0.643822
TO%_team_offense -0.504452
EXP_team_offense 0.456270
Avg_Drive_Time_Sec_drives 0.418059
Name: W_-_L_%_standings, dtype: float64
'''
So, let us look at this years data to see the variables with a coefficient above
.6 for W_-_L_%_standings.
'''
In[]: df_corr_2021 = df_ana_2021.iloc[:,8:].corr(method='pearson')
In[]: df_corr_win_loss_per_2021 = df_corr_2021[['W_-_L_%_standings']]
In[]: df_corr_win_loss_per_2021.loc[df_corr_win_loss_per_2021['W_-_L_%_standings'] > .6]
In[]: df_corr_win_loss_per_2021
Out[]:
W_-_L_%_standings
W_standings 0.995154
W_-_L_%_standings 1.000000
Y_d_s_passing 0.622321
T_D_passing 0.634460
T_D_%_passing 0.637850
Y_/_A_passing 0.656540
A_Y_/_A_passing 0.696654
Y_/_G_passing 0.619605
R_a_t_e_passing 0.702980
N_Y_/_A_passing 0.731743
A_N_Y_/_A_passing 0.716489
E_X_P_passing 0.791657
Scoring_XPA_kicking 0.787438
Scoring_XPM_kicking 0.796545
Kickoffs_KO_kicking 0.773950
Kickoffs_KOYds_kicking 0.799903
Kickoffs_TB_kicking 0.636094
Kickoffs_KOAvg_kicking 0.679450
APYd_returns 0.617156
Downs_3D%_conversions 0.640967
Red Zone_RZAtt_conversions 0.788430
Red Zone_RZTD_conversions 0.783980
Sc%_drives 0.850716
Average Drive_Yds_drives 0.722622
Average Drive_Pts_drives 0.849570
R_e_c_T_D_scoring 0.634460
A_l_l_T_D_scoring 0.754195
X_P_M_scoring 0.796545
X_P_A_scoring 0.787438
P_t_s_scoring 0.800817
P_t_s_/_G_scoring 0.834334
PF_team_offense 0.800817
Yds_team_offense 0.732371
Tot Yds & TO_Y/P_team_offense 0.679825
1stD_team_offense 0.727823
Passing_Yds_team_offense 0.622321
Passing_TD_team_offense 0.634460
Passing_NY/A_team_offense 0.731743
Sc%_team_offense 0.850716
EXP_team_offense 0.825778
'''
One figure that really sticks out is Sc% or percentage of drives ending in an
offensive score. And this makes a lot of sense. For a team to win a lot they need
to be good at moving the ball down the field and then score.
The other obvious corrolations all relate to scoring. This makes sense becuase to
win a team must score.
'''